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O ' Abstract 

5_i , Bayesian inference is attractive for its coherence and good frequentist properties. How- 

ever, it is a common experience that eliciting a honest prior may be difficult and, in practice, 
people often take an empirical Bayes approach, plugging empirical estimates of the prior 
hyperparameters into the posterior distribution. Even if not rigorously justified, the under- 
lying idea is that, when the sample size is large, empirical Bayes leads to "similar" inferential 
answers. Yet, precise mathematical results seem to be missing. In this work, we give a more 
rigorous justification in terms of merging of Bayes and empirical Bayes posterior distribu- 
tions. We consider two notions of merging: Bayesian weak merging and frequentist merging 
in total variation. Since weak merging is related to consistency, we provide sufficient con- 
ditions for consistency of empirical Bayes posteriors. Also, we show that, under regularity 
conditions, the empirical Bayes procedure asymptotically selects the value of the hyperpa- 
rameter for which the prior mostly favors the "truth". Examples include empirical Bayes 
<^ • density estimation with Dirichlet process mixtures. 

Keywords: Consistency, Bayesian weak merging, Frequentist strong merging, Maximum 
t^> marginal likelihood estimate, Dirichlet process mixtures, g-priors. 



1 Introduction and motivation 

The Bayesian approach to inference is appealing in treating uncertainty probabilistically through 
conditional distributions. If (X\, . . . , X n )\0 have joint density p^ 1 and has prior density ir(6\\), 
then the information on 9, given the data, is expressed through the conditional, or posterior, 
density ir(0\\, x\, . . . , x n ) oc Pg\xx, . . . , x n )ir(0\\). Despite Bayes procedures are increasingly 
popular, it is a common experience that expressing honest prior information can be difficult and, 
in practice, one is tempted to use some estimate A n = X n (xi, . . . , x n ) of the prior hyperparameter 
A and a posterior distribution 7r(-|A n , x\, . . . , x„). This mixed appr o ach is usually referred to 



as empirical Bayes in the literature (see, e.g., iLehmann and Casellal (|1998l )) and we shall also 



refer to it as empirical Bayes (EB) in this paper. The underlying idea is that, when n is large, 
EB should reasonably lead to inferential results similar to those of any Bayes procedure. Thus, 
an empirical Bayesian would achieve the goal of inference without completely specifying a prior. 
From a Bayesian point of view, when there is uncertainty on A, a cleaner solution than EB would 
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be to put a hyperprior on A. However, using a hierarchical prior is often computationally heavy, 
thus EB appears as an attractive altern ative that is expec t ed to be "better" than fixing a "wrong" 
value of the hyper parameter. Recently. IScott and Bergerl (|2010l ) showed that, in the special case 
of variable selection in regression models, EB procedures could have very strange features and 
rather pathological behaviours. They showed that if a Bayesian approach is based on a prior on 
models of the form 

n(M 7 |p)«p fc -(i-p) m - fc -, P e(o, l), 

where M 7 denotes the model associated to the vector of inclusions 7 6 {0, l} m , m being the 
total number of candidate covariates, and k-y corresponds to the number of covariates used in 
the model 7, then the maximum marginal likelihood estimator (marginal MLE) for p can be 
equal to or 1 with positive probability. This corresponds to a very degenerate EB prior on the 
distribution of models. This might still lead to interesting pointwise estimates of the model or of 
the whole parameter, however, in terms of distribution (posterior) is far from being satisfactory. 

In this paper, we try to shed light on such phenomena by answering the following question: 
is it actually true that an EB procedure is asymptotically close to a regular Bayes procedure? 
Of course, it depends on what close here means. So, first we consider a weak notion of closeness 
which mainly corresponds to consistency and is still satisfied in the above pathological example. 
Then, we give a generic explanation of the above phenomenon, describing when and why marginal 
MLE EB procedures will be pathological or, on the contrary, when and why they will have some 
quality of optimality. 

We start by formalizing the asymptotic comparison in terms of merging of Bayes and EB 
procedures. We consider two notions of merging. First, we study Bayesian weak merging in 
the sense of iDiaconis and Freedmanl (j 19861 ). which means that any Bayesian is sure that her/his 
posterior and the EB posterior will eventually be close, in the sense of weak convergence. It thus 
appears as a minimal requirement in order for the EB procedure to be r easonable from a Bayesian 



Ghosh and Ramamoorthi 



viewp oint. Then, we study frequentist strong merging in the sense of 
(|2003l ). which means that the Bayes and EB posterior distributions will be eventually close, in 
the total variation distance, P^°-almost surely, where Pfi° = denotes the probability law of 
(Xi)i>i under 9q, the true value, in the frequentist sense, of the parameter 6, usually referred to 
as the true distribution. It is worth noting that, when strong merging holds, if Bernstein von- 
Mises holds in the Li-sense for the Bayes posterior, then it also holds for the EB posterior. We 
prove that, under some conditions, Bayesian and frequentist merging of Bayes and EB procedures 
hold. However, these results require conditions, possibly more restrictive than expected. Thus, 
a further contribution of this work is to identify some typical situations wherein merging fails, 
which, despite appearances, are not surprising. These results have therefore the practical interest 
of characterizing, at least in the parametric case, those families of priors to be used with regard 
to EB procedures and those to be avoided, in particular if one is not merely interested in point 
estimation, but in some more general characteristics of the posterior distribution such as credible 
regions. 
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Developing from iDiaconis and Freedmanl (|l986l ). we see (Section [2]) that weak merging of 
Bayes and EB posterior distributions holds if and only if the EB posterior is weakly consistent, 
in the frequentist sense, at 9q, for every 9q in the parameter space G. Thus, conditions for 
weak consistency of EB posteriors are needed. This provides a Bayesian motivation for studying 
frequentist consistency of EB posteriors. Consistency is, however, of autonomous interest as well 
and we consider it in a general context, beyond the case of exchangeable (or i.i.d. ~ Pg ) data. 
We provide sufficient conditions for consistency of EB posteriors that cover both parametric 
and nonparametric cases. In fact, an EB approach is even more tempting in nonparametric 
problems, since frequentist properties of Bayes pro c edure s are known to crucially depend on 
the fine details of the prior (jDiaconis and Freedmanl . Il986l ) and on a careful choice of the prior 
hyperparameters. We exhibit examples to illustrate EB consistency of Dirichlet process mixtures 
which is a commonly used nonparametric prior. Conditions for EB consistency seem stronger 
than those needed for consistency of Bayes posteriors: although for Bayes posteriors the so- 
called Kullback-Leibler prior support condition implies weak consistency, it is not the case for 
EB posteriors. Thus, to validate EB posteriors we require stronger conditions and provide a 
counterexample where the EB posterior is not weakly consistent, despite validity of the Kullback- 
Leibler prior support condition. 

Even when consistency and weak merging hold, simple examples show that the EB posterior 
can have unexpected and counterintuitive behaviors. Frequentist strong merging is a way to 
refine the analysis, allowing to perceive differences between Bayes and EB posterior distributions, 
even at first-order asymptotics. Obtaining strong merging of Bayes posteriors in nonparametric 
contexts is often impossible since pairs of priors are typically singular. Thus, in tackling this issue, 
we concentrate on parametric models and on the specific, but important, case of the marginal 
MLE A n . In this setup, we find that the behavior of the EB posterior is essentially driven by 
the behavior of the prior at 9$. Roughly speaking, if sup^g^ 7r(#o| A) = oo, then the EB posterior 
will tend to concentrate "too much" around 9$ to merge with any posterior associated with a 
positive and continuous (at 9q) prior density. We illustrate this behavior in Bayesian regression 
with g-priors. Conversely, if sup^g^ 7r(#o| A) < oo, then frequentist strong merging holds and 
\ n converges to the set of values for which such supremum is attained, say Ao assuming here 
uniqueness for ease of exposition. This value can be understood as the prior oracle as it is the 
value of A that mostly favors 9q. In other terms, Ao is that value which allows to make the most 
precise statements a priori on 9q. Under this respect, the EB posterior achieves some kind of 
optimality. An open issue is whether Ao is also optimal with respect to some other criteria. We 
briefly discuss this and related open issues at the end of the paper. 

We underline that, in the literature, the term "empirical Bayes" is used with different mean- 
ings. One is that herein considered; yet, people also refer to empirical Bayes in problems where 
a "prior" is introduced, but some frequentist interpretation of it is possible, typically in mixture 
models where Aj|A '*~' JqPo(-) dG(9\X). In this case, EB corresponds to maximum likelihood 
estimation of A, i.e., MLE of the mixing distribution. A Bayesian approach to these problems 
would assign a prior distribution on A. In this case, a comparison between Bayes and EB reduces 
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to the interesting, but more standard, comparison between MLE and Bayes procedures and it is 
not studied in this note. 

1.1 General context and notation 

Let X and O denote the observational space and the parameter space, respectively. They can be 
very general, we only require that they are complete and separable metric spaces, equipped with 
their Borel <7-fields. Hereafter, any parameter space will follow the same assumption and, for 
such a space 0, the Borel cr-field is denoted by B(Q). Throughout the paper, unless otherwise 
stated, the same symbol d is used to denote any metric (possibly semi- metric) . Let (JQ)j>i be 
a sequence of random elements, with the JQ's taking values in X. Suppose that the probability 
measure of (Xj)j>i is modeled as a family {P£° : 9 G 0}. We denote by the joint probability 
law of (X\, . . . , X n ). We assume that P^ is dominated by a common cx-finite measure fi (in 
principle, the dominating measure may depend on n, but we drop the n to simplify the notation) 
and denote by p^ the density of P^ w.r.t. fj,. 

Let {II(-|A) : A £ A} be a family of prior probability distributions on 0. Given a prior 
II(-|A), we denote by IT(- 1 A, X±, X n ) the corresponding posterior distribution of 9, given 
(Xi, . . . , X n ). Since the model is dominated, the natural version of this conditional distribution 
is given by the Bayes' rule. In the sequel, we use the short notation X± :n := (X\, . . . , X n ), 
xv.n ■= (x\, ■ ■ ■ , x n ) and x°° := (xi, x 2 ■ ■ ■ )■ 

The EB approach consists in estimating the hyperparameter A by A n = \ n {Xi- n ) and plug- 
ging the estimate into the posterior distribution. In general, A n takes values in the closure A of 
A. For Ao in the boundary <9A of A, we define II(-|Ao) as the weak limit of n(-|A) for A — > Ao, 
when it exists. We use the notation P n =4> P to mean that P n converges weakly to P, for any 
probabilities P n , P. We shall say that the EB posterior is well defined if II(-|A n , Xi :n ) is a prob- 
ability measure for all large n, P^°-almost surely, for all 9. Then, the EB posterior is defined 
as 

U(B\X n , X Vn ) = JsPe 1 (Xi:n)dll(9\\ n ) ^ ^ ^ ^ ^ 

f e $\x lm )dii(e\K) 

In what follows, the EB posterior corresponding to the estimator A n will be also denoted by 
II n ^ . Throughout the paper, the EB posterior is always tacitly assumed to be well defined. 
Many types of estimators A n can be considered: in particular, the marginal MLE, defined as 

A n G argmax m(Xi :n |A), where m(X 1:n \X) := / p ( f } (Xi., n ) dII(6>|A), 
aga Je 

is the most popular. Whenever we consider the marginal MLE, we assume that sup^ gj \ m(X\ :n \ A) < 
oo for all large n, P^°-almost surely, for all 9, and write m(Xi :n ) := m(X\. n \ A n ). We shall present 
general results for the EB posterior without specifying the type of estimator A n considered, as 
well as specific results for the marginal MLE. 

The paper is organized as follows. In Section [21 we study Bayesian merging and consistency 
of EB posteriors. Specifically, in subsection 12.11 we begin by studying weak merging of Bayes 



4 



and EB procedures for exchangeable sequences. In subsection 12.21 we give sufficient conditions 
for consistency of the EB posterior for non-i.i.d. data. Parametric and nonparametric examples 
are provided in subsection 12.31 In Section [3l we study frequentist strong merging and obtain, as 
a by-product, that, when strong merging takes place, the EB procedure leads to an oracle choice 
of the prior hyperparameter. Situations where strong merging fails are illustrated in Bayesian 
regression with g-priors in subsection 13.31 Some open issues are discussed in Section 2J 



2 Bayesian weak merging and consistency 



We begin by studying Bayesian weak merging in the sense of iDiaconis and Freedmanl (j 19861 ) of 
Bayes and EB procedures in the basic case of exchangeable data. Since Bayesian merging is 
intimately related to consistency, we then give conditions for consistency of the EB posterior in 
the general case. 



2.1 Bayesian merging of Bayes and empirical Bayes inferences 

Merging of Bayes and EB inferences formalizes the idea that two posteriors or, in a predictive 
setting, two predictive distributio ns of all future observ a tions given the past, will eventually 



be close. A well-known result by iBlackwell and Dubinsl (|1962l ) establishes that, for P and Q 



probability laws of a process (Xj)j>i, if P and Q are mutually absolutely continuous, then there 
is strong merging of the predictions of future events, given the increasing information provided 
by the data X\ :n . For exchangeable P and Q corresponding to priors II and q, respectively, P 
and Q are mutually absolutely continuous if and only if II and q are such. However, the EB 
approach only gives a sequence of "posterior distributions" II n ^ , without having a properly 
defined probability law of (Xj)j>i. Thus, the above result on strong merging does not apply. 
Furthermore, Blackwell and Dubins' result does not apply when the priors are singular, as it is 
ofte n the case in nonparametric p roblems . 



Diaconis and Freedmanl ()1986l ) gave a notion of weak merging that applies even when strong 
merging does not. They showed that two Bayesians with different priors will merge weakly if 
and only if one Bayesian has weakly consistent posterior, in the frequentist sense, at 9, for every 
9 £ 0. We show that an analogous result holds for Bayes and EB: in fact, EB merges with any 
true Bayes on if and only if the empirical Bayesian has posterior weakly consistent at 9, for 
every 9 £ 0. Recall that two sequences of probability measures p n and q n are said to merge 
weakly if and only if | J gdp n — f gdq n \ —> for all continuous and bounded functions g. 

The results are herein restricted to the case of exchangeable sequences, thus, given 9, the 
Xi$ are independent and identically distributed (i.i.d.) with common distribution Pg. We 
denote by P^° the corresponding infinite product measure on X°° . For any prior II on 0, the 
joint probability law of the parameter and the data is given by 

P n (A x B) := I Pg°(B) dU(9) for all Borel sets A C 6 and B C X°°. 
J A 
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We denote by Tl n the posterior distribution of 9, given X\- n . Note that here we do not need 
to assume a dominated model {Pg '■ 9 G 0}, as long as there exists a natural version of the 
posterior distribution. We also consider the "predictive" distribution of all future observations 
X n+ i, X n+2 , ... , conditionally on X 1:n , given by 



Pn n (-) := [ P e °°(-)dU n 
Je 



(9). 



A posterior distribution II n is weakly consistent at 9 if, for any weak neighborhood W o£ 9, 
Rn{W c ) -> a.s. [P°°], for all 9 G 6. 

Let us now consider the EB pos terior EL ^ described in subsec tion 11.11 The following result 
is a consequence of Theorem A.l in 



Diaconis and Freedman 



(|1986h 



Proposition 2.1. Let the map 9 i— > Pq be one-to-one and Borel. Given a family of priors 
{n(-|A) : A 6 A}, let Yl n ^ be the EB posterior. The EB posterior is consistent at any 9 G if 
and only if, for any prior probability q on Q, the EB posterior and the Bayes posterior q n weakly 
merge, as n —> oo, with P q -probability 1. 

Moreover, if the map 9 M> Pq is continuous and has continuous inverse, then the EB posterior 
is consistent at any 9 if and only if, for all priors q on Q, the predictive distributions Pq - and 
Pq n weakly merge, with P q -probability 1, that is, 



yfeC B (X°°), 



X- 



/d(Pn 







a.s. [P q 



Diaconis and Freedman 



Proof. It suffices to note that the proof for the equivalences (i)—(iv) in Theorem A.l of 
(|1986l ). page 18, goes through to the present case: in fact, it is based on the properties of the 
Bayes posterior q n , which hold true in this case, whereas, for the EB posterior II n ^ , only 
consistency is required. □ 

Proposition 12.11 shows that any Bayesian is sure that her/his estimate w.r.t. the quadratic 
loss of any continuous and bounded function g will asymptotically agree with the EB estimate 
if and only if the EB posterior is consistent at any 9. If so, in particular, a Bayesian with prior 
n(-|A) is sure that \ f g(9) dU(9\X n , X 1]n ) - f g(9) dU(9\X, X 1:n )\ -)■ 0. 



2.2 Consistency of empirical Bayes posteriors 

Weak merging gives a Bayesian motivation for consistency as a conditio sine qua non for inter- 
subjective agreement, as seen in Proposition 12.11 Of course, consistency is a basic property in 
itself from a frequentist viewpoint. Therefore, it is of interest to study consistency of the EB pos- 
terior distribution in greater generality, beyond the case of i.i.d. (or exchangeable) observations. 
Consistency of parametric and nonparametric Bayesian procedures is fairly well understood. 
However, sufficient conditions for consistency of a Bayesian posterior are not enough for consis- 
tency of the EB posterior, since, in the latter case, the EB prior is data-dependent through A n . 
We give two results on consistency of EB posteriors that hold true for dependent sequences and 



6 



cover both parametric and nonparametric cases. To be more specific, let (0, d) be a semi- metric 
space. For any e > 0, let U e = U e (9o) := {9 G © : d(9, 9q) < e} denote the open ball centered at 
9q with radius e. Note that d(-, ■) is understood as a loss function and, depending on the model, 
it can be a distance directly on 6, for instance (9 — 9') 2 if 6 is real, or a (pseudo-)distance on 
the model p g n \ for instance the Hellinger distance between densities in a density model or the 
L2-norm between regression functions in a regression model. We provide sufficient conditions so 
that, for any e > 0, the posterior probability 

Yi(U c e \\n, X lM ) a.s.[P °°], 

where Pfi° denotes the probability measure of (Xj)j>i, under 6q. 



2.2.1 Sufficient conditions for consistency of empirical Bayes posteriors 

Recall that th< 
For 9 G G, let 



Recall that the model P^ is dominated by a a-finite measure \i and has density p g n ^ w.r.t. [i. 



R{pf) := ^(X 1:n ) 

denote the likelihood ratio. We shall use the following assumptions. 
(Al) There exist constants c\, C2 > such that, for any e > 0, 

P * ( sup R{p e n) ) > e~ c ^ 2 ) < C2 (ne 2 )-( 1+ *) 
\eevg J 

for some t > 0, where Pq denotes the outer measure. 
(A2) For each #o G 0, there exists Ao G A such that, for any i] > 0, 

U(B KL (9 ; rj)\\ ) > 0, 

where, for KL 00 (^o; 0) := — lim n _ 5 . 00 n~ l log R{p^), the set -E>kl(#o; ??) := G : 
KLoo^q; 9) < r,}. 

When compared to the assumptions usually considered for posterior consistency, (Al) is quite 
strong; it is, however, a common assumption in t he maximum like lihood estimation literature. 



It is verified in most p arametric model s , see, e.g., ISchervishl (|1995l ). and also in nonparametric 



models. For instance, 



Wong and Shenl (|1995l ) proved that, for i.i.d. observations with density 
fe, if U e is the Hellinger open ball centered at /o := fe with radius e, that is, U e = H e (fo) := 
{fg : h(fg, /o) < e}, a sufficient condition for (Al) to hold true is that there exist constants 
C3, C4 > such that, for each e > 0, 

r-V2e 

/ \ H\](u/cs, 0, h) du < C4\/ne for n large enough, (2-1) 
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where the function Hn(-, 0, h) denotes the Hellinger bracketing metric entropy of 0. 

In this paper, (Al) is used to handle the numerator of the ratio defining the EB posterior 
probability of any neighborhood U e in the following way. By the first Borel-Cantelli lemma, (Al) 
implies that supg gf / c R{p^) < e~ Cinf2 for all large n, a.s. [-Pq ]- Thus, 



du(e\x n ) < n(c/ e c |A n ) sup R(pv 



(n), 



< sup R{pf ] ) < e" cine2 
6eU° 



for all large n, (2.2) 



Po°-almost surely. Note that the bound in (|2.2p is valid for any type of estimator A n . 

Assumption (A2) is the usual Kullback-Leibler prior support condition, herein required to 
hold true for some value Ao £ A. It is a mild assumption considered in most results on posterior 
consistency and has been shown to be satisfied for various models and families of priors. Note that 
the rather abstract definition of KLoo(-; •) is mainly considered to deal with the non-i.i.d case. 
In the i.i.d. case, KLoo^o; 0) is simply the Kullback-Leibler divergence between the densities 
pg and pq (per observation). In the present context, it is used when A n is the marginal MLE 
to bound from below m(X\- n )/p^ (X\ :n ). For other types of estimator A n , a variant of (A2) is 
considered, cf. conditions of Proposition [2T3l 

As explained in subsection 11.11 there are infinitely many possible choices for A n . We first 
study consistency of the E B poster i or wh e n X n is the marginal ML E , which is a common EB 
approach, see, for instance, iBergerl ()1985l ). iGeorge and Foster! (|2000l ). IScott and Bergerl (|2010l ). 
just to name but a few. We then consider the case where A n is any estimator for which some 
direct knowledge is available, anot her common EB approach, e ven w ithout any explicit mention 
of it being an EB procedure, see 
EB in a nonparametric setting. 



McAuliffe. Blei and Jordan 



(|2006l ) for an example of plug-in 



2.2.2 Case of the maximum marginal likelihood estimator 

Recall that \ n E argmax A6i \ m(X\- n \X) and rh(Xi :n ) < oo. We have the following result. 

Proposition 2.2. Under assumptions (Al) and (A2), for any e > 0, 

n([/ e c |A„, A 1: „)^0 a.s.[P °°]. 

Proof. Fix e > 0. Set N n := J Uc R{pf ] ) dII(0|A n ), under (Al), by (|2~2]) . N n < e~ Cine2 for 
all large n, P^°-almost surely. Let D n := J e R(pg) dH(0\\ n ). By definition of m(Xi- n ), with 
Po°-probability 1, 

rh(X 1:n ) m(X 1:n \ A ) 



D r 



An) 



> 



--■ D n (\ c 



for all large n, 



where Ao is as required in (A2). Reasoning as in Lemma 10 of iBarronl ()1988l ). page 23, for 
any r] > 0, D n (\o) > e~ nr] for all large n, P^-almost surely. Choosing < rj < cie 2 , for 
5 ■= (c l6 2 _ v ) > o, we have U(U^\X n , X l:n ) = N n /D n < N n /D n (X ) < e~ n& for all large n, 
Po°-almost surely. The assertion follows. □ 
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Remark 2.1. Although it seems intuitive that the usual Kullback-Leibler condition on the 
positivity of the prior mass of Kullback-Leibler neighborhoods of Pg^ , say (A2), implies weak 
consistency of the EB posterior, as it happens for the true posterior, it is, however, not the case 
and additional assumptions on the behavior of the likelihood r atio and /o r on th e prior need to 
be r equired, as illustrated i n the following example. Consider Bahadui ( 1958 )'s exam ple, see 



also 



Lehmann and Casella 



(|1998l ). pages 445-447, and iGhosh and Ramamoorthil (|2003l ) . pages 
For each 9 = k, a density pg on [0, 1] is defined as follows. Let a® = 1 and 
C] dx = 1 — C, where < C < 1 is a given constant and 
h(x) = e 1 ^ 2 . Since J e l l x dx = oo, the o^'s are uniquely determined and the sequence — > 
as k —> oo. For 9 £ 0, define 



29-31. Let 9 = N* 
define recursively by J^' 1 [/i(.x 



Pe(x) 




if ag < x < ag-i, 

if x £ [0,l]n(a e , ae-iY 

otherwise. 



Let X±, . . . , X n \9 *~ ' pg. The MLE 9 n exists and tends to oo in probability, regardless of the 
true value 9q = ^0 of 9. It is, therefore, inconsistent. On the other hand, G being countable, by 
Doob's theorem, any proper prior on leads to a consistent posterior at all 9 £ 0. Consider a 
family of priors {n(-|A) : A £ A} such that, for each 9, there exists A £ A for which Xl(- 1 A) = 6$. 
If A £ dA, then II(-|A) is defined as the weak limit of any sequence II(-|Aj) for Xj — > A (when it 
exists). It is always possible to construct such a family of priors. For A = (to, a), let 



II(1|A) := $((1/2 - m)/<r) - $((-1/2 - m)/cr), 
n(9\X) := $((0 - 1/2 - m)/<r) - *((0 - 3/2 -m)/a) 

+ $((-0 - 3/2 - m)/a) - $((- 



1/2 -m)/a) 



for 9 > 1, 



where $ is the cumulative distribution function of a standard Gaussian random variable. By 
taking m = 9 — 1 and letting a — > 0, we have as a limit the Dirac mass at 9 because IT( 1 1 A) = 
and n(#|A) — > 1. Thus, for each ko £ N*, by taking m = ko — 1 and letting a — > 0, we have as a 
limit the Dirac mass at &o- Then, the EB posterior is the Dirac mass at the MLE 9 n , which is 
inconsistent. To see this, note that 



VA£A, to(A 1: „|A) < Y[pe n (Xi) and m(X 1:n ) = m(X 1:n \(§ n - 1, 0)) = Hpg n (Xi). 

i=l i=l 

Remark 2.2. Proposition 12.21 gives a result on consistenc y, however, replacing e wi t h e„, i n 
(Al) and with e n in the stronger version of (A2) as found in lGhosal and van der Vaartl (|2007al ). 
£n := (e« V e n ) is an upper bound on the contraction rate for the EB posterior. On the other 
hand, e n is an upper bound also on the rate of convergence for the Bayes posterior Il(-|Ao, X\- n ). 
Choosing the value Ao that leads to the best prior concentration rate i n results in the best 
posterior rate e n , at least if e n < e n , in which case, the EB procedure attains some kind of 
optimality. We will precise this kind of results in the parametric framework in Section 3. 
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2.2.3 Case of a convergent A n 

In some applications, A n is chosen to be a convenient statistic, like some moment estimator, so 
that the prior is centered at a plausible value for the parameter. In such cases, X n has often 
a known asymptotic behavior, which does not necessarily mean that the EB posterior should 
have a stable behaviour, even if the prior has. In the following proposition, we give sufficient 
conditions for consistency of the EB posterior in such situations. Suppose that the parameter is 
split into 9 = (r, 0, where r G T and C G Z and, given Ad AC Mr, r ~ n( - |A) while £ ~ YL. In 
other words, the hyperparameter A only influences the prior distribution of r. The overall prior 
is n(-|A) := f[(-|A) x IT(-). Let 9 = (r , Co) be the true value of 9 = (r, Q. 

Proposition 2.3. Let Il(-|A n ) ; n = 1, 2, and IT(-|Ao) be probability measures on £>(T). 
Assume that (Al) is satisfied and 

(*) IT(-|A n ) ^ri(-|Ao) a.8.[Pg°], 

(ii) for each i] > 0, there exists a set K v C .Bkl(#o; r/) such that II(^|Ao) > 0, 
(Hi) defined, for each x°° G X°° and any rj > 0, the set 



(n) 

1 l^ P (D.fo), 



(r, C) S K v : - log -t^ — (xi :n ) ^ KLoo^to, Co); (r, C)) for some r n r 



^ G B(T) ® B(Z) and, /or -almost every x°° £ X°° , 

U(E^i\X ) = 0. (2.3) 

Then, for any e > 0, 

n([/ e c |A n ,, Xi :n )-> a.s.[P °°]. 

The proof of Proposition 12.31 is postponed to the Appendix. 

Remark 2.3. Condition (i) is a natural condition in those cases where A n is an explicit estimator 
(as opposed to the marginal MLE) such as a moment type estimator, see Example 2, Gaussian 
DP mixture -II, in subsection 12.31 Condition (ii) is the usual Kullback-Leibler prior support 
condition, except for the fact that here it concerns the support of the limiting prior. Condition 
(in) is more unusual. If, in the definition of E^X, the r n 's were fixed at r, then (Hi) would 
be a basic ergodic condition on the support of II(-|Ao), so the difficulty comes from obtaining 
an ergodic theorem uniformly over neighborhoods of r. In the case of i.i.d. observations, the 
following condition implies (Hi). If 



V77, e > 0, V# G K v , 35 = 5(9, e) > : E c 



sup 

9'ee:d(0',0)<<5 



1 Pe t Y ' 

log — (Xi t 
Pe' 



6 

<2' 



then standard SLLN arguments imply that there exists a set C X°° , with P^°(X^ 
such that, for each x°° G Xq° , condition fj2.3[) is satisfied. 

We now present some examples illustrating the above consistency results. 



o°°) 
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2.3 Examples 

We begin by considering a parametric example. Even if very simple, this example is illuminating 
since it illustrates the different types of behaviour of consistent EB posteriors to be expected 
when A n is the marginal MLE. These phenomena are studied in greater generality in Section [3j 
where it is shown that the behavior of the EB posterior and the marginal MLE is driven by the 
behavior of the map A i— > tt(6q\X), where ir(6o\\) is the prior density, given A, evaluated at 9q. 

Example 1: Parametric case 

Let JTi, . . . , X n \6 ~ pf\ G M, with {pf ] : G R} such that (Al) holds true. Let 9 be given 
a Gaussian prior distribution, with mean m and variance r 2 , 9 ~ N(?n, r 2 ). Consider the EB 
posterior with the marginal MLE A n in the following three cases: 

(1) t 2 is fixed and A = m is estimated, 

(2) m is fixed and A = t 2 is estimated, 

(3) A = (m, r 2 ) and both parameters are estimated. 

Interestingly, the behavior of the EB posterior II(-|A n , X\ :n ) is quite different in the three cases. 
As an illustration, we consider the simple case where Xi\9 1 ~ ' N(0, a 2 ), with a 2 known, which 
satisfies (Al). Indeed, the choice of the sampling model is of little consequence to the asymptotic 
behaviour of A n . 

Case (1). The posterior distribution of 9 corresponding to a fixed value r 2 is N(m n , (1/r 2 + 
n/c" 2 )" 1 ), where 

a 2 /n t 2 - a 2 /n r 2 - 

m n :— — j7 ■ 9 * ' 2~~i TT~ n ~ ~T1 \ 2 m ^ 2~~i YT~ n ' 
a L jn + r z r z + a z /n o L jn + r z r z + g a jn 

The EB posterior is obtained plugging the marg inal MLE, A n = X n , and it is N(X n , (l/r 2 + 
n/o" 2 )^ 1 ), which has a completely regular density. Both posteriors can be seen to be con- 
sistent by direct computations. 

It is worth making a comparison with the hierarchical Bayes, which typically assigns a 
normal hyperprior on A = m, that is, A ~ N(Ao, Tq) so that H h (9) = N(#|Ao, t 2 +Tq). Note 
that the hierarchical prior increases the uncertainty on 9. The posterior is N(m^, (1/(t 2 + 
Tq) + n/a 2 )^ 1 ), where has the same expression as m n , with r 2 replaced by r 2 + Tq. 



Case (2). Recall that from lLehmann and Casellal ()1998l ). page 263, when A = r , with m fixed, 



2-2 2 /a> n2t -2 0-2 { n(X n — m) 2 
a + nr = max{ a , n(X n — m) }, so that r = — max < ^ 1,0 



The EB posterior II(-|f 2 , X\- n ) is normal with mean and variance having the same ex- 
pressions as in Case (1), with r 2 replaced by f 2 . A hierarchical Bayes approach would 
assign a prior on r 2 , e.g., 1/r 2 ~ Gamma(z^/2, 2/u). This leads to a Student's-t prior 
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distribution for 8, with "natter" tails, t hat may give better frequentist properties, see, 
for example, Berger and Robert (jl99ol ). Berger and Strawderman (1996). However, the 
Student's-t prior is no longer conjugate and the EB posterior is simpler to compute. 

In this example, the EB posterior is only partially regular, in the sense that f 2 can be 
equal to zero with positive probability so that II(-|f 2 , X\ :n ) is degenerate at m, although, 
in the case where m 7^ 8q, this probability converges to zero. This type of behavior is 
discussed more extensively in subsection 13.31 where it is shown that, if m 7^ 80, the EB 
posterior merges strongly with any regular posterior, including the hierarchical posterior, 
with probability going to 1, whereas, if m = 60, there is a positive probability that the EB 
posterior does not merge strongly with any regular posterior. 

Case (3). In this case, the marginal MLE for A = (mo, t 2 ) is A n = (X n , 0). The EB posterior 
is completely irregular in the sense that it is always degenerate at X n . Note that this 
example is much more general than the Gaussian case and applies, in particular, to any 
location-scale family of priors. Indeed, if the model p^ admits a MLE 8 n and vr(-|A) is of 
the form a~ 1 g{{- — fi)/a), with A = (/U, a), for some unimodal density g which is maximum 
at 0, then A n = (6 n , 0) and the EB posterior is the point mass at 9 n . This shows that such 
families of priors should not be used in combination with marginal MLE EB procedures. 

Next, two nonparametric examples concerning Dirichlet process (DP) location-scale and lo- 
cation mixtures of Gaussians are exhibited: in the first one, the marginal MLE for the precision 
parameter of the DP base measure is considered, in the second one, a moment type estimator 
for the mean of a normal base measure is employed. 



Example 2: Nonparametric case 

Gaussian DP mixture — I. Consider the following nonparametric model of Gaussian location- 
scale mixtures: the observations Xi\G Pg(') '■= f 0("|/ u ) °~ 2 ) dG(fi, a), where </>(-|/U, a 2 ) 
denotes the density of a Gaussian random variable with mean [i and variance a 2 . In this 
example, 9 = G belongs to the set G of all probability measures on R x R + *. We assume 
that G ~ DP (a(-| A )), where «(-|A) denotes a fami l y of p ositive and finite measures on 



R x R + * . iLiul (|1996l ) and iMcAuliffe. Blei and Jordan! (120061 ) consider EB procedures in this 
type of models. In particular, ILiul (|1996l ) considers the marginal MLE for A = a(R x R + *), 
fixing the base probability measure a(-) := a(-)/a(R x M + * ) . Even if he considers a mixture 
of Binomial distributions, the argument remains valid for other types of Dirichlet process 
mixtures. For the s ake of simp licity, we present our computat i ons in the case of Gaussian 



mixtures. Following ILiul dl99g), see also 

A 



i=i 



A + j-1 



Petrone and Raftervl (jl997l ). A n is the solution of 
E[K n \X, X lM ], (2.4) 



where E[iT n |A, Xi- n ] is the expected number of occupied clusters under the conditional 
posterior distribution, given A. If we assume that a has support A x [a, a], with A a 
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compact interv al ofM, 0<cr<cr<oo and = {G : supp(G) C A x \a, cr]}, then, from 



Theorem 3.2 of lGhosal and van der Vaartl (120011 ). page 1244, {pc '■ G G 0} has bracketing 
Hellinger metric entropy satisfying condition (|2.ip . so that assumption (Al) is fulfilled. 
Moreover, if pc is a mixture of Gaussian distributions, Pg {') := / °" 2 )dGo(/i, cr), 

with supp(Go) C Ax [cr, a), then also condition (A2) is satisfied. The existence of a 
solution of (|2.4p implies that the EB posterior is well defined, thus, using Proposition 12.21 
we get consistency for the EB posterior. 



Gaussian DP mixture — II. Consider the following model of Gaussian location mixtures: 
Xi, . . . , X n \(F, cr) pf,ct{') '■= f 4>{'\^i °" 2 ) dP(/i). In this case, 9 = (P, a) is assumed to 
take values in := {(P, a) : F(M) = 1, cr G [a, cr]}, with < cr < ct < oo. Letting a(-) : = 
a(M)a(-), with fixed precision < a(K) < oo and probability measure a specified, up to the 
mean, as N(//„, cr?), we assume that P ~ DP(a) and cr ~ H, with supp(P) = [a, a]. In this 
case, A = 6 Kj for which the estimator A n := X n is considered. Let a n {-) := N(X n , cr|) 
and the un- normalized EB base measure & n (-) := a(M)a n (-) = a(]R)N(X„, cr?) so that the 
EB prior on (P, cr) is DP(a n ) x H . We prove consistency of the EB posterior w.r.t. the 
Hellinger distance h or the Li-distance. Let rriQ : = E,q[Xi] be the mean of X\ under Pf ao> 
with cr G [cr, cr] and Fq satisfying Fo([— a, a] c ) < e _c ° a2 for all large a and a constant 
Co > 0. For fixed e > 0, choose < 5 < e 2 small enough and a n = n 9 , with 1/2 < 
q < 1. Consi der the sieve set 9 n := j(P cr) : P(| — o n , a n ]) > 1 — 6, cr G [cr, cr]}. From 



Theorem 6 in 



Ghosal and van der Vaart 



Ghosal and van der Vaartl (|2007bl ) , combined with the proof of Theorem 7 in 
J2OQ7D), if Q n := {(P, a) : F([-a n , a n }) = 1, a G [a, a}}, for 



all e 2 /2 8 < u < \f2e, Hn(u/cs, n , h) < a n (logo n + log(l/e)) 2 . Thus, for n large enough, 



f e/2 s{H[](u/c 3 , n , /i)) 1/2 du < e^Ja^(\oga n ) < e 2 y/n because a n = n q with q < 1. By 

n6n ^(pSj^) dn(F, cr|A n ) < e" Cine2 for all large n. We now 



-almost surely, J H , 



show that the expected value of the integral over 0^ tends to 0. In this case, Aq 



a(R)N( mn J al). Since X n 



m n, we have a n 



Ghosh and Ramamoorthi 



m . 

Q , 



(|2003h . pages 105-106, 
is fulfilled. Denote by 



Define «o(-) := a(R)c5o(-) 
a.s. [Pq ], whence, by Theorem 3.2.6 in 

DP(a n ) =>■ DP(ao)> a - s - [^0°]' an d condition (i) of Proposition 
II(-|Ao) the overall limiting prior DP(ao) x H. Since F ~ DP(a n ), using the stick-breaking 
representation, we have PF,a(-) = YfjLiPjfai' ~ £?')> witn £j ~ N(X n , cr|). As 4> a {- — tij) = 
4> a (- - (X n -mo)-£j), with £j ~ N(m , cr|), we have Pf )(T (-) = p f >(- - (X n -m )), with 
P' ~ DP(ao). Let A n be the set wherein the inequality \X n — mo\ < L/y/n holds true for 
some constant L > 0. Note that P (n) (A^) can be made as small as needed by choosing L 
large enough. Using the above representation of the EB Dirichlet prior, 

n 

(X n - m )) < J 



PF'A X ^-r 



7T- 



a(Xi-0e^^dF'(0 

^ a fl [ g^-OdF'iO, 
7"T JR 



i=l 
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where g a is the probability density proportional to <p cr (y)e L ^^ yf ^ a2 ^ 1 and 

2L 
<J\fn 



My)e Llyl/{VEa2) dy<e L2 ^ 2n ^ (l 



which implies that 
Eo 



l An {Xi:n)l R(p { F n ^)dU(F,a\X n ) < (l + ^yn(G^|A )<e- c ^ +c ^< e - c 34 5 
for n large enough, by definition of a n . 

Next, we bound from below the denominator of the ratio defining the EB posterior proba- 
bility of the set H°. Using similar computations to those above, on A n , 

[ R(pP a )dn(F,a\\ n )>e~iclf a ) dU(F' , a\X ), 

Je Je 

L \-(\ 

with ()F',a{') = c n a Jr 0<t(" ~~ Cj e v ^ ct2 d-F'(£), where, with abuse of notation, we still 
denote by Cn )CT the normalizing constant, although it is not e xactly the same as before 



Ghosal and van der Vaart 



Using similar computations to those in the proof of (5.21) in 
( 2001 ). we obtain that, for any r] > 0, J R{gp) a ) dH(F', <j\\q) > exp{ — nrj], for n large 



enough. Consistency of the EB posterior follows. 

We have seen in Example 1 that, even in simple models, the EB posterior under the marginal 
MLE may be degenerate, which, from a Bayesian perspective, is a pathological behaviour. In 
the following section, we explain why and when such behaviour is to be expected. The key factor 
is the choice of the family of priors {n(-|A) : A G A}. This has strong practical implications since 
it shows that, unless some kind of strong shrinkage is required, it is better avoid maximizing over 
scale hyperparameters. 

3 Frequentist strong merging and asymptotic behavior of X n 

For each A G A C R e , i £ N, let II(-| A) be a prior on 6 C R k , k G N, with density vr(-| A) w.r.t. 
a common measure v. Before formally stating a general result which describes the asymptotic 
behaviour of EB posteriors, we present an informal argument to explain the heuristics behind it. 
Under usual regularity conditions on the model, the marginal distribution, given A, can be thus 
approximated 

m(X 1:n \X) = n(9 \X) 6n nk/2]m]1/2 (1 + 

(n) 

under Pg . If we could interchange the maximization and the limit, we would have 

argmax m(Xi :n |A) = argmax 7r(#o|A) + o p (l). 
AeA AeA 
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An interesting phenomenon occurs: assuming the above argument is correct, the marginal MLE 
is asymptotically maximizing the family {tt(9q\X) : A £ A), where 9q is the true value of the 
parameter. In other words, it selects the most interesting value of A in terms of the prior model. 
We call the set of values of A maximizing tt(9q\X) the prior oracle set of hyperparameters and 
denote it by Ao- In terms of (strong) merging, however, Ao may correspond to unpleasant values, 
typically if 

sup vr (6*o | A) = oo 

AeA 

and Il(-|Ao) is the Dirac mass at 9q. Then, the EB posterior is degenerate. This is what happens 
in Cases (2) and (3) of Example 1 or, more generally, when vr(-|A) is a location-scale family and 
A contains the scale parameter. Obviously, in such cases, we cannot interchange the limit and 
the maximization. We now present these ideas more rigorously. 

Let d be a semi- metric on and, for any e > 0, let U e denote the open ball centered at 9q 
with radius e. The map g : 9 i— > sup^ 6i \ ir(9\X) from to M + induces a partition {©o, ©§} of 
0, with O := {9 G : g(9) < oo} and Og := {9 G : g(9) = oo}. We refer to the case where 
g(9o) < oo or, equivalently, 9q E ©o, as the non- degenerate case and to the complementary case 
as the degenerate case. As illustrated in the above heuristic discussion and in Sections 13.11 and 
13.21 below, the use of this terminology is motivated by the fact that, in the former case, the EB 
posterior is regular, whereas, in the latter case, it tends to be "too much concentrated" at 9q to 
merge strongly with any regular Bayes posterior. 

3.1 Non-degenerate case 

We give sufficient conditions for the EB posterior n(-|A n , X\- n ), where A n is the marginal MLE, 
to merge strongly with any posterior XI(- 1 A, Xi :n ), A € A, in the non-degenerate case. The 
possibility of having a degeneracy as in Case (2) of Example 1 is ruled out by assuming 9q 6 ©o- 
When 0q € ©o, we define the set Ao := {Ao £ A : 7r(#o|Ao) = g(#o)} an d the subset Ao C Ao 
of all those Ao £ Ao for which the map i— > 7r(#|Ao) is continuous at #o an d) f° r an Y e i V > 0, 

u(u e nB KL (e - rj)\x ) > o. 

Theorem 3.1. Suppose that 9q E ©o- Assume that (Al) is satisfied and 
(i) the map g : 9 i— )■ sup^g^ ^(^lA) is positive and continuous at 9q, 
(it) A + 0, 
then, for each Ao £ Ao, 

™ {Xun) 1 a.s.[P -]. (3.1) 



m(Ai :n |A ) 

//, in addition to (i) and (ii), the following assumption is satisfied 

(Hi) Aq = Aq is included in the interior of A and, for any 5 > 0, there exist e, i] > so that 

ir(9\X) 

sup sup — — — < 1 - r), 

eeUe AeA: d{\, Aq)><5 9\P) 
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where d(X, A ) := mfA o6 A d(X, A ), 

then 

d(A„, Ao)^0 and ||7r(-|A n , - 7r(-|A , Xi :n )||i -> a.s. [P°°]. (3.2) 

The proof of Theorem 13.11 is presented in the Appendix. Some remarks and comments are 
in order here. 

Remark 3.1. A key set of values of A for understanding the asymptotic behavior of the EB 
posterior is therefore the prior oracle set of hyperparameters, Ao, which consists of all those 
values for which the prior mostly favors 9q. Ecjuation (|3.2p shows that, Pq° -almost surely, X n 
"converges" to Ao- This result implies that the EB procedure asymptotically selects the smartest 
values of A for "estimating" 9q. Furthermore, in this case the EB posterior does not degenerate. 

Remark 3.2. Under the conditions of Theorem 13.11 if ir is any prior density w.r.t. v which 
is positive and continuous at 9q and whose posterior is consistent at 9q, then the EB posterior 
merges strongly with the posterior correspondin g to the prior IT. This is a direc t consequence 



of Theorem 13.11 combined with Theorem 1.3.1 of lGhosh and Ramamoorthil (|2003l ). pages 18-20 



In particular, the EB posterior merges strongly with any hierarchical Bayes posterior associated 
with a prior II , for h a prior on A, provided the map 9 i— y ir h (9) := J A ir(9\ A) dh(X) is positive 
and continuous at 9$ and the posterior corresponding to Ii h is consistent at 9q in terms of the 
priors IT(-| A), A £ A. 



3.2 Degenerate case and extension to the model choice framework 

Theorem 13.11 implies that, if g(0o) < oo, under smoothness assumptions on 7r(0|A) for 9 in 
a neighborhood of 9q, the EB posterior will eventually be close to any Bayes posterior based 
on a smooth prior density. Example 1, Cases (2) and (3), suggests that this might not be 
the case when g(9o) = oo and there exists a Ao € A for which II(-|Ao) = Sg . In Section 
15.31 of the Appendix, we show that, for such 9$, the Li-distance between the EB posterior 
and the Bayes posterior associated with any A £ A is bounded from below on a set whose 
probability remains asymptotically strictly positive, so that no strong merging can take place. 
This critical phenomenon cannot be improved by having greater smoothness in the likelihood. 
Indeed, consider usual regularity assumptions on the model, i.e., (Al) is satisfied, for any e > 
and 9 £ U e , 

l n {0) - l n (§ n ) e _ < e - e n)'I%){e-9 n ) (i ± ^ ^ denoting the M LE, 
and l n {9n) — ln{9o) converges in distribution to a x 2 -distribution with k degrees of freedom. Also, 

(n) 

assume that p g is bounded as a function of 9 for all n. Then, if #o G Oq and there exists Ao £ A 
such that II(-|Ao) = Sg , the EB posterior cannot merge strongly with any posterior II(-|A, Xi :n ), 
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with A € A such that the prior density 7r(-| A) is positive and continuous at Oq. In particular, it 
cannot merge wit h any hierarchical poste rior. 



Interestingly, IScott and Bergerl (|2010l ) also encounter this phenomenon in their comparison 
between fully Bayes and EB approaches for variable selection in regression models. We believe 
this is due to the same reasons as described above, although it does not completely fit the setup 
we have described because we have restricted ourselves to priors that are absolutely continuous 
w.r.t. Lebesgue measure. However, this is not a crucial difference. We describe in an informal 
way the link between our explanation above and their findings. First, we briefly recall their 
setup. They consider a regression model 

Yi = a + X-/3 + e;, e< ~ N(0, X 4 G l m , m > 1, 

and their aim is to select the best set of covariates among the m candidates. They consider the 
following hierarchy: given a model indexed by the inclusion vector 7 £ {0, where 7$ = 1 

means that the ith covariate belongs to the model M 7 , consider a prior vr 7 (0) = 7r(a, ^)vr^(/3), 
where TTu{/3) is absolutely continuous w.r.t. Lebesgue measure on as a distribution on 
the coefficients included in the model 7. Then, they consider U(M~ f \p) = p k ^(l — p^ m ~ k i ^ for 
p € (0, 1), and study the EB approach which consists in computing the marginal MLE for p. 
The marginal likelihood is written as 



m(Y\p) = ^^(1 -p) m -^m 7 (Y). 



Each model is regular so that, under Pg™\ with Oq = (ao, (3 , 4>q), 

—m ' c ^ = (27r) I^WI > 

for all 7 such that we cannot have jj = and /3oj 7^ 0, otherwise, the marginal is exponentially 
small. Hence, if j3 Q = 0, the above marginals are maximized (in 7) at 7 = due to the Ockham's- 
razor effect of integration and, with probability going to 1, the EB posterior distribution puts 
mass 1 on Mo. Generally speaking, if Oq belongs to a model with ko covariates, the EB posterior 
will concentrate on p = ko/m with probability going to 1. Again, this is an oracle value since, 
in terms of prior on the models, it is centered at the right number of covariates, however, if Oq 
is either in the null or in the largest model, it corresponds to a completely degenerate prior on 
the models. 

This is not merely specific of the linear regression example, and, in a general model choice 
framework with competing models Mj, j = 1, . . . , J, when A corresponds to a hyperparameter 
on the distribution of models, it is not only the prior values tt(9q\X) which drive the behaviour 
of the EB, but rather the values 



E 

j:e eMj 



fpdj/2 
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due to integration over the parameters in Mj. Thus, A n asymptotically maximizes P(M*\X), 
where M* is the smallest model containing 6q. Depending on the form of P(M*\X), the EB prior 
distribution can be degenerate or not. 

In the following section, we describe more carefully regression with ^-priors, which has been 
considered in the literature in combination with EB procedures. 



3.3 Example: Regression with g-priors 



Consider the canonical Gaussian regression model Y = la + X/3 + e, with e ~ N n (0, <r 2 I), 
where Y = (Y\, . . . , Y n )' is the response vector, 1 denotes the vector of l's of length n, a is the 
intercept, X is the n x k fixed design matrix, (3 = (J3i, . . . , /3k)' is the fc-dimensional vector of 
regression coefficients, a 2 is the error variance and I denotes the n x n identity matrix. Clearly, 



1, . . . , n, where 



Y\a, (3, a 2 ~ N n (la + X/3, a 2 l). Note that Y^\a, (3, a 2 '~ - N(a + x^/3, a 2 ), 
x^ is the ith row of X: the Y^s are (conditionally) independent, but not identically distributed. 
Let X denote the design matrix whose columns have been re-centered so that l'X = 0', in which 
case (3 can be estimated separately from a using OLS estimators. The following condition is 
assumed throughout: for fixed 1 < k < n, the matrix n _1 (X'X) is positive definite and converges 
to a positive definite matrix V. We consider the following default prior specification for a, (3, a 2 : 



7r(a, a 2 ) oc — ^ 



/V~N fc (0, ^(X'X)" 1 ), with 5 >0, 



(3.3) 



where the prior covariance matrix of (3 is a scalar multiple g of the covariance matrix a" 2 (X'X) _1 
of the OLS estimator (3 of (3. The prior mean for (3 can, in principle, be any /3 a £ nonetheless, 
we take f3 a = because this choice helps keeping the presentation at a simple technical level, 
without any loss of generality for the purpo se of this study. The prior in (|3.3j) . which is a 
modified version of the original Zellnerl ( 19861 ) 's q-prior, is widely used i n the variable selec tion 
literature, see, e.q.. lClvde and George) (|2000l ). iGeorge and Fosterl (|2000l ) . iLiang et all (|2008l ). In 
this example, A = g, theref ore we consider the EB posterior of (3 with the marginal MLE of g, 
which, from equation (9) in lLiang et al\ (|2008l ). page 413, is known to be 

R 2 /k 



g n := max{F n - 1, 0}, 



where F n 



(1 - R 2 )/(n — 1 — ky 



R 2 being the ordinary coefficient of determination. Suppose that Y is generated by the model 



with parameter values ao, (3 , a 2 . Let (fi, J 7 , 
defined. It turns out that 

lim„^. 



denote the probability space wherein Y is 



if /3c 



0, 



= 0) = lim^ oo P(F n <l)> 7 >0 
lim^oo F(g n > 0) = lim^oo P(F n > 1) = 1, if (3 + 0. 



(3.4) 



Interestingly, when (3 = 0, even if the prior guess on the value of (3q is correct, the probability 
that the marginal MLE takes a value in the boundary (thus causing the EB posterior to be 
degenerate) does not asymptotically vanish. Conversely, when f3 ^ 0, even if the prior guess on 
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(3 is wrong, the probability that the EB posterior is non-degenerate tends to 1, as n — > oo. To 
prove ((331), let 

~ _ (ff-/3 )'(X'X)Q3-/3 )/fc 
SSE/(n - 1 - jfe) 

• If (3 = 0, then F n = F n -^-V X^/k, because SSE/(n — k — 1) <7q, and lim n _ 5>00 P(^ n = 
0) > F(x 2 k /k < 1) =: 7 > 0; 



if /3 0, from consistency of /3, i.e., (3 (3 



o- 



n- 1 [(/3 -2 / 3) / (X / X) / a ]/A; ^ (/3' V/3 )/fc 



-tin ■— „„„ : ~ > n < U, 

T 



SSE/(n - 1 - fc) o\ 



which implies that 1 + nR n — > — oo. Consequently, 

F(g n > 0)=F(F n > 1) = P^F n > 1 + ^^sSE^^^ff^ ) =¥{Pn > 1 + L 

We now study the consequences of (I3.4p on frequentist merging in total variation. Some prelimi- 
nary remarks are in order. For each g > 0, the posterior II(-|<?, Y) of (3 is absolutely continuous 
w.r.t. Lebesgue measure on R fc . Let 7r(-|<7, Y) denote its density. By direct computations, what- 
ever (3q £ R fc , for each g > 0, II(-|g, Y) =>- 8p , a.s. [P]. Defined the set fl n := {<7 n = 0}, clearly, 
&n Q {n(-|^ n , Y) = Jo}- We have the following results. 

• If O = then, for each g > 0, lim^^Pfdrvdlf-lq, Y), n(-|<? n , Y)) = 1) > 0, where 
c?TV denotes the total variation distance. Therefore, even if the prior guess on f3 is 
correct, there is a set of positive probability wherein strong merging cannot take place. 
In fact, on Q n , for the Borel set A = {0}, we have n(A|g, Y) = because A has null 
Lebesgue measure. Then, 1 > d TV (U(-\g, Y), U(-\g n , Y)) > \H(A\g, Y) - 6 (A)\ = 1. 
Thus, lim n ^ 00 P({d TV (n(-| ff , Y), U(-\g n , Y)) = 1}) > ^ P(O n ) > 7 > 0. 

• If p ^ then, for each g > 0, P(|k(-|<7, Y) - 7r(-|p n , Y)||i -)■ 0) -)■ 1, where vr(-|ff n , Y) 
denotes Lebesgue density of the EB posterior. The result assures that, even if the prior 
guess on /3 is wrong, strong merging takes place on a set with probability tending to 1. 
This is a direct consequence of Theorem 13.11 



4 Final remarks 



In this paper, we discussed whether the common knowledge that an EB posterior is asymptot- 
ically equivalent to other Bayes procedures is correct. We first discussed weak merging as a 
minimal re quirement to motivate, from a B ayesian viewpoint, the use of EB procedures. Along 

(|l986l )'s results on consistency of Bayesian procedures, we 



Diaconis and Freedman 



the lines of 

showed that merging is intimately related to consistency of the EB posterior and some general 
conditions on weak consistency are provided. Weak merging is, however, too weak a criterion 
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to shed light on some pathological examples related to EB posteriors. Hence, we also studied 
strong merging in a more restricted framework, which has enabled us to characterize the families 
of priors which could lead to those degenerate behaviours. 

In Section [3l we showed that, at least for finite-dimensional parameter spaces, the EB proce- 
dure with the marginal MLE asymptotically selects the oracle value of the hyperparameter, that 
is the value for which the prior mostly favors 6q. An open issue is whether this value also leads 
to optimal frequentist asymptotic properties of the EB posterior. In nonparametric problems, 
for instance, frequentist asymptotic properties of Bayes procedures may crucially depend on the 
fine details of the prior: in particular, the posterior may have sub-optimal or optimal contraction 
rate depending on the value of a hyperparameter A and there exists a value A* entailing the 
minimax-optimal rate. An open question is whether the oracle value Ao may be optimal in this 
sense. 

A different problem related to the one herein investigated is that of maximum likelihood esti- 
mation and Bayesian inference for exchangeable data. The data would be physically exchangeable 
with probability law P\, rather than physically independent (i.i.d. according to Pq). Then, the 
EB approach would account for maximum likelihood estimation of A, whereas Bayesian infer- 
ence would put a prior on A. In this case, there would be a true value Ao of A and frequentist 
asymptotic properties should be studied w.r.t. P^, rather than w.r.t. Pq° = P^. 

Acknowledgements. This work originated from a question by Persi Diaconis. We are grateful 
to him and Jim Berger for stimulating discussions. S.P. has been partially supported by the 
Italian Ministry of University and Research, grant 2008MK3AFZ. 



5 Appendix 

5.1 Proof of Proposition 12.31 

Fix e > 0. Set N n := J Uc R(j>^) dII(0|A n ), under (Al), N n < e ~ Cine2 for all large n, P °°-almost 
surely. Let D n := f R{jp^) dII(#|A n ). In order to bound from below D n , it is convenient to 
refer to the probability space, say (Q, J-, P), wherein the Aj's are defined. Let 

/4")Q:=n(-|A n (u;)), n = l, 2, and /x (-) := fi(-|Ao). 

Let := {uj € : fin un\. By assumption (i), P(f?n) = 1- For any uj £ by Skorohod's 
theorem (cf. Theorem 1.8 in lEthier and Kurta (|1986l ). pages 102-103), there exists a probability 
space (17', J-', p) on which T-valued random elements Y^, n = 1, 2, . . ., and Yq are defined 
such that Yn^ ~ pri \ n = 1, 2, . . ., Yq ~ fj,Q and diY^ Yq{oj')) — > for p-almost every 
J G ST. Let Sli := {uj G £7 : (££2) holds true}. Clearly, P($7 n fii) = 1. Fix uj G (Qq n ^i). For 
any 77 > 0, let 

{(n) 
(r, C) G K v/2 : lkn i log (X l:n (uj)) = KL^tq, Co); (r, 0) for all r n -> r 

P (r n ,0 
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By assumptions (ii)-(m), n(sJ^|A ) > 0. Defined the set D^l := {(a/, () : (Y (oj'), C) € sff 2 }, 



I (a/, C) dp(a/) dfi(C) = n(5^|A ) > 0. 



(5.1) 



By Fubini's theorem, a change of measure and Fatou's lemma, 



lim e nv D n > lim exp < 

n— >-oo JZ J CI' n^roo 



n 



(n) 



dp(a/)dn(C) 



> 



/z7n' D i/2 



x lim exp < 



n 



(n) 



dp(w')dn(C) = oo, 



because the integrand is equal to oo over a set of positive probability, see (j5.1j) . Thus, D n > e _rM? 
for all large n, P^-almost surely. Choosing < 77 < Cie 2 , for 5 := (cie 2 —7/) > 0, we have 
n(C/ e c |A„, = N n /D n < e~ nS for all large re, P °°-almost surely. The assertion follows. 



5.2 Proof of Theorem I37T1 

We begin by proving (|3.1|) . From (iz), for each Ao G Ao, P^°-almost surely, m(Xi :n |Ao) > for 
all large n. By definition of A n , < m(Xi :n \\o) < rh(X\- n ) < 00 for all large re, whence 

rh(X\. n ) 



> 1 



for all large n, 



(5.2) 



m(Xi :re |A ) 

Po°-almost surely. We prove the reverse inequality. Using (Al), (i) and (u), for any <5 > 0, 
there exists e > (depending on 6, 9q and g(0o)) so that, with probability greater than or equal 
to l-c 2 (ne 2 )-( 1+t ), 

m(X 1:n \X) 



VA g A, 



< e 



< e 



< e 



< e 



-c\ne 



-cine 



+ 



R(p^)n(e\X)M0) 
R[pt ) )g(9)dv{9) 



" cl - 2 + (1 + 5/3) / R(p ( i l) )g(9 )du(9) 



u, 



-cine' 



+ (1 + 26/3) [ R(p^)7T(9\X )du(e), 
Ju f 



where the second inequality descends from the definition of g, because ir(9\\) < g(9) for all 
9 G U e , the third one from the positivity and continuity of g at #0 an d the last one from the fact 
that g(9o) = 7r(#o|Ao), together with the continuity of 7r(#|Ao) at 9$. By the first Borel-Cantelli 
lemma, for any 5 > 0, there exists e > so that 

m(X 1:n |A) 



VA G A, 



< e 



1 + 2(5/3) / R(p^ ) )7r(9\Xo)du{9) for all large n, 
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i-Q^-almost surely. The Kullback-Leibler condition on II(-|Ao) implies that, on a set of Pq°- 
probability 1, 



Vfl > 0, 



R( P ^)7:(e\x )d^e) > 



for all large n. 



(5.3) 



Therefore, for any 5 > 0, on a set of P^-probability 1, for each A G A, m(Xi :n |A) < (1 + 
6)m(Xi tn \Xo) for all large n, which, combined with (|5.2p . proves (|3.ip . 

We now prove the convergence of A n ,. Recall that, by assumption (Al), for any e > 0, on a 
set of Pq° -probability 1, 



V A G A, m(Xl - |A) < e-^ 2 



+ [ R{pf ) )ir{9\\)dv(9) for all large n. 



For 5 > 0, define Ng := {A G A : d(\, Aq) < 5}. For any fixed 5 > 0, by assumption (in), there 
exist e±, t] > so that, on a set of P^-probability 1, 

m(Xi :n |A) 



sup . . 



*W < e -cme? + (i _ 7? ) / R(pP)g(0) du{9) for all large n, 

l:n) ''fei 



whence, using (i) and (ii) on the continuity of g and 7r(-|Ao), Ao £ Ao, at #o> 

m(Xi. n |A) „ „ c 2 , . m(Xi. n |Ao) 

sup ^ 1 ; < e" Cine i + (1 - r]/2) ; 1 U7 for all large n. 



A e ^^(X 1;n ) 



Using (|5.3p . we finally get that sup AgAr c m(Xi :n \ A) < (1 — ry/4)m(ATi :n |Ao) for all large n, -Pq°- 
almost surely. The fact that r\ is fixed implies that, for n large enough, A n G iVj, a.s. [-Pq ]. Since 
Ao is included in the interior of A, with Pq°° -probability 1, A n belongs to the interior of A and 
II(-|A n ) <C v for all large n. This fact, combined with consistency of both the EB posterior and 
Il(-|Ao, X\-n), and the convergence in (|3.ip . yields that, Pg°-almost surely, for any e > 0, 



|7r(-|A n , X Xxn ) - 7r(-|A , A"i :n )||i < e + 

< e + 



» 



[Xl-n] 



7t(0|A O ) 



m(Xi :n ) 



m(Xl;„|A 

(n) 



m(X 1:n ) m(Xi :n |A ) 
1 



di/(0) 



(Xl:n) 



< 2e + 



^ m(Xi :n ) 
Ut m(X 1:n ) 



\Tv(9\\ n )-Tv(9\\o)\du(e) 



ir(9\\ n )-ir(9\\ )\dv(9) 



for n large enough. We split U t into L> e := {0 G f7 e : tt(0|A„) > 7r(0|A o )} and D c t = {9 G U e : 
ir(9\\ n ) < 7r(#|Ao)}. Since, for any 5 > 0, if e is small enough, 7r(#|A n ) < tt(9\Xq)(1 + 5/3), 



'A 



(X 1:n )[7r(0|A n ) - tt(9\X )} du(9) < £ / p( n) (X 1:n )7r(0|A o ) di/(0) < ^m(X 1:n ). (5.4) 



D, 
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From consistency of the EB posterior, 



4 n) (X 1: „h(#|A o )cM0) < m(X l:n ) < / pf\x 1 .. n )Ti{e\X n )dy{e) + (e + 5/3)m(X 1:n ) 



whence 



p% l) (X hn )[7r(9\\ )-7r(9\\ n )}dij(9) < / p { ; i) (X 1:n )[Tr(9\\ n )-7r(9\\ )} du(9)+(e+S/3)m(X v . n ) 
and, using (|5.4|) . 

P(X 1:n )[ir(e\Xo) -n(9\X n )]du(9) < (e + 2S/3)rh(X 1:n ), 



Do 

which implies that 

r ^ ( y \ 

/ i,\r lM J \n(0\*n) ~ vr(6»|A )| du(9) < (e + 5) for all large n. 
J Ue m(X 1:n ) 

Thus, (13, 2ft is proved and the proof is complete. 

5.3 Proof for the non-merging of the posterior in Section 13.21 

We first recall the formal framework. 

Theorem 5.1. Suppose that 9q E 0q. Assume that (Al) is satisfied and 
(i) there exists Xq E Ao such that Il(-|Ao) = 5e Q , 
(ii) with Pq™' -probability going to 1, rh(X\. n ) >p^(X\,_ n ), 

(Hi) the model admits a LAN expansion in the following form: for each e > 0, there exists a 

(n) 

set, with Pjj -probability going to 1, wherein, uniformly in 9 £ U e , 

l n (0) - l n (p n ) e _ n ( 9 - 9 n)'I(0o)(.0-e n ) ^ ± ^ ^ denoting the MLE, 

(iv) l n (0 n ) — l n (9o) converges in distribution to a x 2 -distribution with k degrees of freedom. 

Then, the EB posterior cannot merge strongly with any Bayes posterior n(-|A, X\- n ), with A E A 
such that the prior density vr(-|A) is positive and continuous at 9q. 



Proof. Define, for any 5 > 0, the set Q U; s of xi- n 's such that e ln ^ 6n ' ln i e °) < 1 + 5. From assump- 
tion (iv), for every 5 > 0, lim^^ Pq™' (f2 n , a) > 0. From assumption (ii), rh(X\ :n ) /p^ (X\ :n ) > 



1. We now study the reverse inequality. Using (Al), for any e > 0, on a set A n with Pq^- 
probability going to 1, 

rh(Xi :n ) 



I e l ^)-Udo) du ^X n ) + 0(e~ 
Ju r 
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Moreover, using the LAN condition (Hi), for every 9 G U e , 

Ue) - ue ) = ue n ) - ue ) + ~ n[e - 6n) '^ me - 6n) (1 + o p (i)), 

so that, if M n := My 7 (logn)/n, with M > 0, on a set of pj n -probability going to 1, 

/ e t«(«W»(&0 dn(0|A n ) = 0(n~ H ) for all H > 0, 

J\\9-e n \\>M n 

provided M is large enough. This leads to 

f( Xl:ra) = e W*»)-*»(«o) /" e -n(«-*n)'/(«o)(fl-k)/2 d n(0|A ra ) + 0(rT H ), 

C(ll:r,) Ju Mn 



-probability going to 1 wherein the above computations are valid, so that, on A n n Q n $ 



where Um„ '■= {& '■ ||# — $n|| < ^n}- With abuse of notation, we still denote by A n the set having 

wherei 

rh(Xx- n ) 

< 1 + lo tor n large enough. 

Let A € A be such that the prior density 7r(-|A) is positive and continuous at 9q. Under assump- 
tions (Hi) and (Al), usual Laplace expansion of the marginal distribution of X\ :n yields 

m(X 1:n \X) _ 7r(g |A)e'"^)^( e »)(2^) fc / 2 
P%\x 1:n ) n*/*\I(W {+ ° p{)h 

so that m(Xi- n \X) / rh(X\- n ) = o p (l). We now study the Li-distance between the two posteriors. 
If n(-|A n ) is degenerate (i.e., it is not absolutely continuous w.r.t. Lebesgue measure, which 
plays here the role of v), then the Li-distance between the EB posterior and the posterior 
corresponding to IT(- 1 A) is 1. Thus, we only need to consider the case where II(-|A n ) is absolutely 
continuous w.r.t. Lebesgue measure. On a set of i-^-probability going to 1, which we still 
denote by A n , intersected with £l n: s, for each 9 £ Um u , 



ir(9\\ n , X 1:n ) - tt(0|A, X 1:n ) = e 



ln(0)-ln(L) 



-n(6-§ n yi(6 n )(6-0 n )/2 n \l(0o)\ n n)) 

(2vr )fc/2 U + WJ 



e l ^~ l ^K(9\\ n )- 



n k l 2 \I(9 )\ 1 / 2 

Set u := \/nI(9o) 1 / 2 (9 — 9 n ) and define V n := {u : g n ( u ) > 1 — 25}, where 

(2n) k / 2 



g n (u) := 7T(9 n + /(0 o )" 1/2 n/v^|A n ) 



n k / 2 \I(9 )\ 1 / 2 ' 
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To simplify the notation, we also denote by V n := {9 = 6 n + I(0q) 1 ^ 2 u/y/n : u £ V n }. Then, 
for all c > 0, 

f g n {u) du = (2vr) fc / 2 f n(e\X n ) d6 < (2vr) fe / 2 

Jv„n{\\u\\<cM ny /K} Jv„n{\\0-e n \\<cM n ^E} 

and, by definition of V n , 



/ g n (u)du > (1 - 25) / du. 

Jv n n{\\u\\ <cM„VH} Jy n n{|jn|j<cA/ nV ^} 

/ du< (2^)- fe / 2 (l-25)- 1 . (5.5) 

JV«.n-fllull<cM„.,/n1- 



Hence 



Note that, on V^, 



(2vr) fc / 2 



so that 

(27T) fc / 2 

'n*/ 2 |J(0o)| 1/2 

and we can bound from below the Li-distance between the two posteriors: on A n n £l n ,5, 



<0\X n )(l + 5) h ,lJ n ~ —l<—6 



[ \7r(e\X n , X 1:n )-n(0\X, X 1:n )\M> [ \ir(e\X n ,X lln )-%(0\\,X lM )\M 

J& JVJrnU M „ 



V£nu Mn 

>s [ -n(e-e n Yi(e a )(e-e n )/2 nk/2 \ I ( e o)\ 1/2 AB 

~ Jv,nu Mn e (2.)^ 

> S / (j){u) du, 

JV£n{\\u\\<cMv<T^7i} 

for some c > 0, since I(9o) is positive definite and where cft(-) is the density of a standard Gaussian 
distribution on M. k . By choosing L > large enough and using (|5.5p . 



/ <^>(u) du > 4>{ u ) du 

JV^n{\\u\\ <cM vTogn} jy,fn{||u||<L} 

>0(L) / du 
^y=n-fllull<Ll 



'v-n{\\u\\<L} 
9K ^r(fc/2 + l) J Vn n{\\u\\<L} J 

> m \ v fJ" L ", - [ du) 

, . 7T k / 2 L k 

^^2fpTIj >0 ' 

which completes the proof. □ 
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